99 research outputs found
Demonstrating 100 Gbps in and out of the public Clouds
There is increased awareness and recognition that public Cloud providers do
provide capabilities not found elsewhere, with elasticity being a major driver.
The value of elastic scaling is however tightly coupled to the capabilities of
the networks that connect all involved resources, both in the public Clouds and
at the various research institutions. This paper presents results of
measurements involving file transfers inside public Cloud providers, fetching
data from on-prem resources into public Cloud instances and fetching data from
public Cloud storage into on-prem nodes. The networking of the three major
Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google
Cloud Platform, has been benchmarked. The on-prem nodes were managed by either
the Pacific Research Platform or located at the University of Wisconsin -
Madison. The observed sustained throughput was of the order of 100 Gbps in all
the tests moving data in and out of the public Clouds and throughput reaching
into the Tbps range for data movements inside the public Cloud providers
themselves. All the tests used HTTP as the transfer protocol.Comment: 4 pages, 6 figures, 3 table
Defining a canonical unit for accounting purposes
Compute resource providers often put in place batch compute systems to
maximize the utilization of such resources. However, compute nodes in such
clusters, both physical and logical, contain several complementary resources,
with notable examples being CPUs, GPUs, memory and ephemeral storage. User jobs
will typically require more than one such resource, resulting in co-scheduling
trade-offs of partial nodes, especially in multi-user environments. When
accounting for either user billing or scheduling overhead, it is thus important
to consider all such resources together. We thus define the concept of a
threshold-based "canonical unit" that combines several resource types into a
single discrete unit and use it to characterize scheduling overhead and make
resource billing more fair for both resource providers and users. Note that the
exact definition of a canonical unit is not prescribed and may change between
resource providers. Nevertheless, we provide a template and two example
definitions that we consider appropriate in the context of the Open Science
Grid.Comment: 6 pages, 2 figures, To be published in proceedings of PEARC2
Recommended from our members
glideinWMS - A generic pilot-based Workload Management System
The Grid resources are distributed among hundreds of independent Grid sites, requiring a higher level Workload Management System (WMS) to be used efficiently. Pilot jobs have been used for this purpose by many communities, bringing increased reliability, global fair share and just in time resource matching. GlideinWMS is a WMS based on the Condor glidein concept, i.e. a regular Condor pool, with the Condor daemons (startds) being started by pilot jobs, and real jobs being vanilla, standard or MPI universe jobs. The glideinWMS is composed of a set of Glidein Factories, handling the submission of pilot jobs to a set of Grid sites, and a set of VO Frontends, requesting pilot submission based on the status of user jobs. This paper contains the structural overview of glideinWMS as well as a detailed description of the current implementation and the current scalability limits
Porting and optimizing UniFrac for GPUs
UniFrac is a commonly used metric in microbiome research for comparing
microbiome profiles to one another ("beta diversity"). The recently implemented
Striped UniFrac added the capability to split the problem into many independent
subproblems and exhibits near linear scaling. In this paper we describe steps
undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the
run time of computing UniFrac on the published Earth Microbiome Project dataset
from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla
V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor
loss in precision). Computing UniFrac on a larger dataset containing 113k
samples reduced the run time from over one month on the CPU to less than 2
hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in
precision). This was achieved by using OpenACC for generating the GPU offload
code and by improving the memory access patterns. A BSD-licensed implementation
is available, which produces a C shared library linkable by any programming
language.Comment: 4 pages, 3 figures, 4 table
Characterizing network paths in and out of the clouds
Commercial Cloud computing is becoming mainstream, with funding agencies
moving beyond prototyping and starting to fund production campaigns, too. An
important aspect of any scientific computing production campaign is data
movement, both incoming and outgoing. And while the performance and cost of VMs
is relatively well understood, the network performance and cost is not. This
paper provides a characterization of networking in various regions of Amazon
Web Services, Microsoft Azure and Google Cloud Platform, both between Cloud
resources and major DTNs in the Pacific Research Platform, including OSG data
federation caches in the network backbone, and inside the clouds themselves.
The paper contains both a qualitative analysis of the results as well as
latency and throughput measurements. It also includes an analysis of the costs
involved with Cloud-based networking.Comment: 7 pages, 1 figure, 5 tables, to be published in CHEP19 proceeding
Running a Pre-Exascale, Geographically Distributed, Multi-Cloud Scientific Simulation
As we approach the Exascale era, it is important to verify that the existing
frameworks and tools will still work at that scale. Moreover, public Cloud
computing has been emerging as a viable solution for both prototyping and
urgent computing. Using the elasticity of the Cloud, we have thus put in place
a pre-exascale HTCondor setup for running a scientific simulation in the Cloud,
with the chosen application being IceCube's photon propagation simulation. I.e.
this was not a purely demonstration run, but it was also used to produce
valuable and much needed scientific results for the IceCube collaboration. In
order to reach the desired scale, we aggregated GPU resources across 8 GPU
models from many geographic regions across Amazon Web Services, Microsoft
Azure, and the Google Cloud Platform. Using this setup, we reached a peak of
over 51k GPUs corresponding to almost 380 PFLOP32s, for a total integrated
compute of about 100k GPU hours. In this paper we provide the description of
the setup, the problems that were discovered and overcome, as well as a short
description of the actual science output of the exercise.Comment: 18 pages, 5 figures, 4 tables, to be published in Proceedings of ISC
High Performance 202
Testing GitHub projects on custom resources using unprivileged Kubernetes runners
GitHub is a popular repository for hosting software projects, both due to
ease of use and the seamless integration with its testing environment. Native
GitHub Actions make it easy for software developers to validate new commits and
have confidence that new code does not introduce major bugs. The freely
available test environments are limited to only a few popular setups but can be
extended with custom Action Runners. Our team had access to a Kubernetes
cluster with GPU accelerators, so we explored the feasibility of automatically
deploying GPU-providing runners there. All available Kubernetes-based setups,
however, require cluster-admin level privileges. To address this problem, we
developed a simple custom setup that operates in a completely unprivileged
manner. In this paper we provide a summary description of the setup and our
experience using it in the context of two Knight lab projects on the Prototype
National Research Platform system.Comment: 5 pages, 1 figure, To be published in proceedings of PEARC2
Flexible Session Management in a Distributed Environment
Many secure communication libraries used by distributed systems, such as SSL,
TLS, and Kerberos, fail to make a clear distinction between the authentication,
session, and communication layers. In this paper we introduce CEDAR, the secure
communication library used by the Condor High Throughput Computing software,
and present the advantages to a distributed computing system resulting from
CEDAR's separation of these layers. Regardless of the authentication method
used, CEDAR establishes a secure session key, which has the flexibility to be
used for multiple capabilities. We demonstrate how a layered approach to
security sessions can avoid round-trips and latency inherent in network
authentication. The creation of a distinct session management layer allows for
optimizations to improve scalability by way of delegating sessions to other
components in the system. This session delegation creates a chain of trust that
reduces the overhead of establishing secure connections and enables centralized
enforcement of system-wide security policies. Additionally, secure channels
based upon UDP datagrams are often overlooked by existing libraries; we show
how CEDAR's structure accommodates this as well. As an example of the utility
of this work, we show how the use of delegated security sessions and other
techniques inherent in CEDAR's architecture enables US CMS to meet their
scalability requirements in deploying Condor over large-scale, wide-area grid
systems
Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs
NVIDIA has been the main provider of GPU hardware in HPC systems for over a
decade. Most applications that benefit from GPUs have thus been developed and
optimized for the NVIDIA software stack. Recent exascale HPC systems are,
however, introducing GPUs from other vendors, e.g. with the AMD GPU-based OLCF
Frontier system just becoming available. AMD GPUs cannot be directly accessed
using the NVIDIA software stack, and require a porting effort by the
application developers. This paper provides an overview of our experience
porting and optimizing the CGYRO code, a widely-used fusion simulation tool
based on FORTRAN with OpenACC-based GPU acceleration. While the porting from
the NVIDIA compilers was relatively straightforward using the CRAY compilers on
the AMD systems, the performance optimization required more fine-tuning. In the
optimization effort, we uncovered code sections that had performed well on
NVIDIA GPUs, but were unexpectedly slow on AMD GPUs. After AMD-targeted code
optimizations, performance on AMD GPUs has increased to meet our expectations.
Modest speed improvements were also seen on NVIDIA GPUs, which was an
unexpected benefit of this exercise.Comment: 6 pages, 4 figures, 2 tables, To be published in Proceedings of
PEARC2
- …